apple silicon
Production-Grade Local LLM Inference on Apple Silicon: A Comparative Study of MLX, MLC-LLM, Ollama, llama.cpp, and PyTorch MPS
Rajesh, Varun, Jodhpurkar, Om, Anbuselvan, Pooja, Singh, Mantinder, Jallepali, Ashok, Godbole, Shantanu, Sharma, Pradeep Kumar, Shrivastava, Hritvik
We present a systematic, empirical evaluation of five local large language model (LLM) runtimes on Apple Silicon: MLX, MLC-LLM, llama.cpp, Ollama, and PyTorch MPS. Experiments were conducted on a Mac Studio equipped with an M2 Ultra processor and 192 GB of unified memory. Using the Qwen-2.5 model family across prompts ranging from a few hundred to 100,000 tokens, we measure time-to-first-token (TTFT), steady-state throughput, latency percentiles, long-context behavior (key-value and prompt caching), quantization support, streaming performance, batching and concurrency behavior, and deployment complexity. Under our settings, MLX achieves the highest sustained generation throughput, while MLC-LLM delivers consistently lower TTFT for moderate prompt sizes and offers stronger out-of-the-box inference features. llama.cpp is highly efficient for lightweight single-stream use, Ollama emphasizes developer ergonomics but lags in throughput and TTFT, and PyTorch MPS remains limited by memory constraints on large models and long contexts. All frameworks execute fully on-device with no telemetry, ensuring strong privacy guarantees. We release scripts, logs, and plots to reproduce all results. Our analysis clarifies the design trade-offs in Apple-centric LLM deployments and provides evidence-based recommendations for interactive and long-context processing. Although Apple Silicon inference frameworks still trail NVIDIA GPU-based systems such as vLLM in absolute performance, they are rapidly maturing into viable, production-grade solutions for private, on-device LLM inference.
Towards Building Private LLMs: Exploring Multi-Node Expert Parallelism on Apple Silicon for Mixture-of-Experts Large Language Model
Chen, Mu-Chi, Huang, Po-Hsuan, Ke, Xiangrui, Tu, Chia-Heng, Xue, Chun Jason, Hung, Shih-Hao
Large Language Models (LLMs) have revolutionized Artificial Intelligence (AI) with significant advancements such as OpenAI's ChatGPT, Meta's Llama, and Databricks' DBRX. This paper addresses the cost and scalability challenges encountered when constructing private LLM systems for personal or small group services, as aimed by Apple Intelligence. A Mac Studio cluster with Apple's M2 Ultra chips is established as a cost-efficient solution to host and accelerate the pretrained DBRX model with the Mixture-of-Experts (MoE) architecture. Our performance analysis reveal that parallel execution of the model's experts across two to four machine nodes significantly reduces inference time. We find that computation time for the experts is comparable to the communication time for exchanging their outputs, emphasizing the importance of network latency over bandwidth. We also observe significant management overhead due to Apple software stack's memory management logic. Based on these findings, we develop optimization schemes to eliminate the memory management overhead. As a result, the Mac Studio cluster is 1.15 times more cost-efficient than the state-of-the-art AI supercomputer with NVIDIA H100 GPUs. In addition, we construct a performance model to estimate system performance under varying configurations, and the model provides valuable insights for designing private LLM systems.
- Europe > Italy (0.05)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- Asia > Taiwan > Taiwan Province > Taipei (0.04)
- (2 more...)
ConsumerBench: Benchmarking Generative AI Applications on End-User Devices
Gu, Yile, Kadekodi, Rohan, Nguyen, Hoang, Kamahori, Keisuke, Liu, Yiyu, Kasikci, Baris
The recent shift in Generative AI (GenAI) applications from cloud-only environments to end-user devices introduces new challenges in resource management, system efficiency, and user experience. This paper presents ConsumerBench, a comprehensive benchmarking framework designed to evaluate the system efficiency and response time of GenAI models running on end-user devices. Unlike existing benchmarks that assume exclusive model access on dedicated GPUs, ConsumerBench simulates realistic multi-application scenarios executing concurrently on constrained hardware. Furthermore, ConsumerBench supports customizable workflows that simulate complex tasks requiring coordination among multiple applications. ConsumerBench captures both application-level metrics, including latency and Service Level Objective (SLO) attainment, and system-level metrics like CPU/GPU utilization and memory bandwidth. Through extensive experiments, ConsumerBench reveals inefficiencies in resource sharing, unfair scheduling under greedy allocation, and performance pitfalls of static model server configurations. The paper also provides practical insights for model developers and system designers, highlighting the benefits of custom kernels tailored to consumer-grade GPU architectures and the value of implementing SLO-aware scheduling strategies.
- North America > United States > California > San Diego County > Carlsbad (0.04)
- Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
- Asia > Singapore (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.71)
Fine-tuning LLaMA 2 interference: a comparative study of language implementations for optimal efficiency
Hossain, Sazzad, Seyam, Touhidul Alam, Chowdhury, Avijit, Xamidov, Munis, Ghose, Rajib, Pathak, Abhijit
This paper conducts a comparative investigation to maximize the effectiveness of Llama2 inference, a critical task in machine learning and natural language processing (NLP). Various programming languages and frameworks, including TensorFlow, PyTorch, Python, Mojo, C++, and Java, are examined, assessing their speed, memory consumption, and ease of implementation through extensive testing and benchmarking. The advantages and disadvantages of each strategy are noted, with suggested optimization methods for parallel processing and hardware utilization. Additionally, the performance of the Mojo SDK, a novel framework designed for LLM inference on Apple Silicon, is investigated, comparing it against established implementations in C, C++, Rust, Zig, Go, and Julia. Through comprehensive benchmarking on an Apple M1 Max, Mojo SDK's competitive performance and its advantages in ease of use and Python compatibility are demonstrated, suggesting it is a compelling alternative for LLM inference on Apple Silicon. Implications for the future of LLM deployment on resource-limited hardware and potential avenues for further research is discussed.
- Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
- Asia > Russia (0.04)
- Asia > China (0.04)
- Asia > Bangladesh > Dhaka Division > Dhaka District > Dhaka (0.04)
The Mac turns 40: How Apple Silicon cured its midlife crisis
The Mac, formerly the more austere Macintosh, turns 40 today, putting Apple's longest-running product squarely in middle age. But like someone who sees the back half of their life approaching and gets in marathon-runner shape, the Mac is in the strongest place it's been for decades. From a revenue perspective, Mac sales declined precipitously in 2023, but that came on the heels of four years of growth that was likely the product of pent-up demand for an improved Mac lineup. In 2020, Apple finally started delivering on that, thanks in large part to Apple Silicon arriving in the Mac, ushering in the era we're in now. While the Mac was on shaky ground prior to Apple Silicon, it would now be pretty silly to suggest the Mac won't make it to its 50th birthday.
- Information Technology > Artificial Intelligence (0.70)
- Information Technology > Hardware (0.48)
- Information Technology > Communications > Mobile (0.33)
Get 'ducking' excited: Apple is finally addressing this annoying autocorrect issue
Apple users who are tired of that "ducking" autocorrect issue can now rejoice! The tech company announced Monday at this year's Worldwide Developers Conference that iOS 17 will ensure that autocorrected words are temporarily underlined so users know what has been changed and can quickly change the word back to what they originally meant to type. "Autocorrect is powered by on-device machine learning and over the years, we've continued to advance these models," said Craig Federighi, the company's software chief. "The keyboard now leverages a transformer language model, which is state of the art for word prediction, making autocorrect more accurate than ever." The autocorrect feature has been the subject of tweets, memes and other social media posts for years, often annoying already irritated people trying to drop a popular expletive by changing the word to "ducking."
Stable Diffusion with Core ML on Apple Silicon - Apple Machine Learning Research
Today, we are excited to release optimizations to Core ML for Stable Diffusion in macOS 13.1 and iOS 16.2, along with code to get started with deploying to Apple Silicon devices. Since its public debut in August 2022, Stable Diffusion has been adopted by a vibrant community of artists, developers and hobbyists alike, enabling the creation of unprecedented visual content with as little as a text prompt. In response, the community has built an expansive ecosystem of extensions and tools around this core technology in a matter of weeks. There are already methods that personalize Stable Diffusion, extend it to languages other than English, and more, thanks to open-source projects like Hugging Face diffusers. Beyond image generation from text prompts, developers are also discovering other creative uses for Stable Diffusion, such as image editing, in-painting, out-painting, super-resolution, style transfer and even color palette generation.
Apple's chips are on the table
Apple's transition to its own processors is nearly complete. The company's recent spring event saw the debut of the Mac Studio and its M1 Ultra processor -- its most powerful piece of silicon yet. But it also revealed what the future of Apple's computers could look like. For the first time, all of Apple's chips are on the table. The first crucial takeaway is that Apple is now a force to be reckoned with when it comes to chips (if it wasn't already).
- Information Technology > Hardware (0.50)
- Information Technology > Artificial Intelligence (0.49)
Apple Macs with silicon chips with Arm processors unveiled today
Apple is expected to unveil the first Mac computers powered by its own custom Arm-based processor at its'One More Thing' event tonight. The event will be livestreamed from the Apple headquarters in Cupertino, California from 18:00 GMT (13:00 ET) on Tuesday, November 10. This marks the first time in the Mac's 36-year history that the line will be powered by an Apple-designed processor, which is said to offer better performance, higher bandwidth and consume less power than the Intel-based machines currently in use. Apple is expected to start shipping the first Arm Macs before the end of the year, with all of the devices boasting the new system within two years. Apple has officially announced its upcoming November 10 event that is set to reveal the tech giant's first Arm-based Macs.
- Information Technology (0.53)
- Semiconductors & Electronics (0.50)
- Information Technology > Communications > Mobile (0.41)
- Information Technology > Artificial Intelligence (0.35)